Raw data

Dataset downloaded from Arkinglab website in the Transcriptome analysis reveals dysregulation of innate immune response genes and neuronal activity-dependent genes in autism section.

Load and annotate data

# Load csvs
datExpr = read.delim('./../Data/datExpr.csv')
datMeta = read.delim('./../Data/datPheno.csv')

# Create dataset with gene information
datGenes = data.frame('Ensembl_ID' = substr(datExpr$Gene, 1, 15), 
                      'gene_name' = substring(datExpr$Gene, 17))
rownames(datExpr) = datGenes$Ensembl_ID
datExpr$Gene = NULL


### CLEAN METADATA DATA FRAME
datMeta = datMeta %>% dplyr::select('ID', 'case', 'sampleid', 'brainregion', 'positiononplate', 
                                       'Gender', 'Age', 'SiteHM', 'RIN', 'PMI', 'Dx')
datMeta$brainregion = substr(datMeta$ID, 1, 4)
datMeta = datMeta %>% mutate(brain_lobe = ifelse(brainregion=='ba19', 'Occipital', 'Frontal'),
                             Diagnosis = ifelse(Dx=='Autism', 'ASD', 'CTL'))

# Convert Diagnosis variable to factor
datMeta$Diagnosis = factor(datMeta$Diagnosis, levels=c('CTL','ASD'))

# sampleid is actually subject ID
datMeta = datMeta %>% dplyr::rename(Subject_ID = sampleid)


# SFARI Genes
SFARI_genes = read_csv('./../../../PhD-Models/FirstPUModel/Data/SFARI/SFARI_genes_08-29-2019_with_ensembl_IDs.csv')
SFARI_genes = SFARI_genes[!duplicated(SFARI_genes$ID) & !is.na(SFARI_genes$ID),]

Check sample composition

Data description taken from the paper Transcriptome analysis reveals dysregulation of innate immune response genes and neuronal activity-dependent genes in autism:

Transcriptomes from 104 human brain cortical tissue samples were resolved using next-generation RNA sequencing technology at single-gene resolution and through co-expressing gene clusters or modules. Multiple cortical tissues corresponding to Brodmann Area 19 (BA19), Brodmann Area 10 (BA10) and Brodmann Area 44 (BA44) were sequenced in 62, 14 and 28 samples, respectively, resulting in a total of 57 (40 unique individuals) control and 47 (32 unique individuals) autism samples.

Note: They analysed all of the regions together

Brain tissue: Frozen brain samples were acquired through the Autism Tissue Program, with samples originating from two different sites: the Harvard Brain Tissue Resource Center and the NICHD Brain and Tissue Bank at the University of Maryland (Gandal’s data were obtained also from the Autism Tissue Program, specifically from the Harvard Brain Bank)

Sequenced using Illumina’s HiSeq 2000 (Gandal used Illumina HiSeq 2500) Check if they are compatible

print(paste0('Dataset includes ', nrow(datExpr), ' genes from ', ncol(datExpr), ' samples belonging to ', length(unique(datMeta$Subject_ID)), ' different subjects.'))
## [1] "Dataset includes 62069 genes from 120 samples belonging to 72 different subjects."

In the paper they talk about an original number of 110 samples and dropping 6 because of low gene coverage, resulting in 104 samples (which are the ones that are included in datMeta), but the expression dataset has 120 samples.

no_metadata_samples = colnames(datExpr)[! colnames(datExpr) %in% datMeta$ID]
no_metadata_subjects = unique(substring(no_metadata_samples, 6))

cat(paste0('Samples without metadata:  ', paste(no_metadata_samples, collapse=', '), '\n\n'))
## Samples without metadata:  ba10.s11, ba10.s12, ba10.s21, ba10.s24, ba10.s87, ba19.s13, ba19.s21, ba19.s54, ba19.s60, ba19.s87, ba44.s12, ba44.s21, ba44.s23, ba44.s24, ba44.s77, ba44.s87
cat(paste0('Samples without metadata but with subject ID in datMeta: ', 
             paste(no_metadata_subjects[no_metadata_subjects %in% datMeta$Subject_ID], collapse=', ')))
## Samples without metadata but with subject ID in datMeta: s11, s13, s60, s23

Since we need the metadata of the samples, I’m going to add the metadata of the samples that share a subject ID with some sample with metadata

add_metadata_subjects = no_metadata_subjects[no_metadata_subjects %in% datMeta$Subject_ID]
add_metadata_samples = no_metadata_samples[grepl(paste(add_metadata_subjects, collapse='|'),
                                                 no_metadata_samples)]

for(sample in add_metadata_samples){
  new_row = datMeta %>% filter(Subject_ID == strsplit(sample,'\\.')[[1]][2]) %>% dplyr::slice(1) %>%
            mutate(ID = sample, 
                   brainregion = strsplit(sample,'\\.')[[1]][1],
                   brain_lobe = ifelse(strsplit(sample,'\\.')[[1]][1]=='ba19','Occipital','Frontal'))
  datMeta = rbind(datMeta, new_row)
}

cat(paste0('Number of samples: ', nrow(datMeta)))
## Number of samples: 108
rm(no_metadata_subjects, no_metadata_samples, add_metadata_subjects, add_metadata_samples, sample, new_row)

And remove the samples that have no metadata and don’t have any other samples that do have metadata.

keep = substring(colnames(datExpr), 6) %in% datMeta$Subject_ID

cat(paste0('Removing ', sum(!keep) ,' samples (', paste(colnames(datExpr)[!keep], collapse=', '), ')\n\n'))
## Removing 12 samples (ba10.s12, ba10.s21, ba10.s24, ba10.s87, ba19.s21, ba19.s54, ba19.s87, ba44.s12, ba44.s21, ba44.s24, ba44.s77, ba44.s87)
cat(paste0('Belonging to subjects with IDs ', 
           paste0(unique(substring(colnames(datExpr)[!keep],6)), collapse=', '), '\n'))
## Belonging to subjects with IDs s12, s21, s24, s87, s54, s77
datExpr = datExpr[,keep]

# Match order of datExpr columns and datMeta rows
datMeta = datMeta[match(colnames(datExpr), datMeta$ID),]

# Check they are in the same order
if(!all(colnames(datExpr) == datMeta$ID)){
  cat('\norder of samples don\'t match between datExpr and datMeta!\n')
}

cat(paste0('Removed ', sum(!keep), ' samples, ', sum(keep), ' remaining'))
## Removed 12 samples, 108 remaining
rm(keep)


Diagnosis distribution: There are more CTL samples than controls, but it’s relatively balanced

cat('By Sample:')
## By Sample:
table(datMeta$Diagnosis)
## 
## CTL ASD 
##  58  50
cat('By Subject:')
## By Subject:
table(datMeta$Diagnosis[!duplicated(datMeta$Subject_ID)])
## 
## CTL ASD 
##  40  32


Brain region distribution: The Occipital lobe has more samples than the Frontal lobe, even though we are combining two brain regions in the Frontal Lobe

table(datMeta$brainregion)
## 
## ba10 ba19 ba44 
##   15   64   29
table(datMeta$brain_lobe)
## 
##   Frontal Occipital 
##        44        64


Most of the Control samples (66%) are from the Occipital lobe, the Autism samples are balanced. This may cause problems because Ctl and Occipital are related

table(datMeta$Diagnosis, datMeta$brain_lobe)
##      
##       Frontal Occipital
##   CTL      19        39
##   ASD      25        25
cat(paste0(round(100*sum(datMeta$Diagnosis=='CTL' & datMeta$brain_lobe=='Occipital')/sum(datMeta$brain_lobe=='Occipital')),
           '% of the Control samples are from the Occipital lobe\n'))
## 61% of the Control samples are from the Occipital lobe
cat(paste0(round(100*sum(datMeta$Diagnosis=='ASD' & datMeta$brain_lobe=='Occipital')/sum(datMeta$brain_lobe=='Occipital')),
           '% of the Autism samples are from the Occipital lobe'))
## 39% of the Autism samples are from the Occipital lobe


Gender distribution: There are thrice as many Male samples than Female ones

table(datMeta$Gender)
## 
##  F  M 
## 26 82


There is a small imbalance between gender and diagnosis with more males in the control group than in the autism group

table(datMeta$Diagnosis, datMeta$Gender)
##      
##        F  M
##   CTL 12 46
##   ASD 14 36
cat(paste0('\n',round(100*sum(datMeta$Diagnosis=='CTL' & datMeta$Gender=='M')/sum(datMeta$Diagnosis=='CTL')),
           '% of the Control samples are Male\n'))
## 
## 79% of the Control samples are Male
cat(paste0(round(100*sum(datMeta$Diagnosis=='ASD' & datMeta$Gender=='M')/sum(datMeta$Diagnosis=='ASD')),
           '% of the Autism samples are Male'))
## 72% of the Autism samples are Male


Age distribution: Subjects between 2 and 82 years old with a mean close to 20

Control samples are less evenly distributed across ages than Autism samples

summary(datMeta$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    8.75   18.00   20.49   22.00   82.00
datMeta_by_subject = datMeta %>% filter(!duplicated(Subject_ID))
datMeta_by_subject %>% ggplot(aes(Age)) +
                       geom_density(alpha=0.5, aes(group=Diagnosis, fill=Diagnosis), color='transparent') +
                       geom_density(alpha=0.5, fill='gray', color='transparent') +
                       theme_minimal()

rm(datMeta_by_subject)


Annotate genes with BioMart information

Cannot find 1580 ensembl ids using the archive that finds the largest number of genes is feb2014 (I cannot find the missing ones in any other archive version).


Filtering

df = SFARI_genes %>% dplyr::select(-gene_biotype) %>% inner_join(datGenes, by=c('ID'='ensembl_gene_id'))

cat(paste0('Considering all genes until now, this dataset contains ', length(unique(df$`gene-symbol`)),
             ' of the ', length(unique(SFARI_genes$`gene-symbol`)), ' SFARI genes\n\n'))
## Considering all genes until now, this dataset contains 976 of the 979 SFARI genes
cat(paste0('The missing genes are ',
             paste(SFARI_genes$`gene-symbol`[!SFARI_genes$`gene-symbol` %in% df$`gene-symbol`],
                   collapse=', '),'\n\n'))
## The missing genes are GRIN2B, MIR137, ZNF8
cat('Lost  genes\'s scores:')
## Lost  genes's scores:
table(SFARI_genes$`gene-score`[!SFARI_genes$`gene-symbol` %in% df$`gene-symbol`])
## 
## 1 3 5 
## 1 1 1
rm(df)


1. Filter genes with start or end position missing

to_keep = !is.na(datGenes$length)

datGenes = datGenes[to_keep,]
datExpr = datExpr[to_keep,]
rownames(datGenes) = datGenes$ensembl_gene_id

print(paste0('Removed ', sum(!to_keep), ' genes, ', sum(to_keep), ' remaining'))
## [1] "Removed 0 genes, 60489 remaining"


2. Filter genes that do not encode any protein

cat(paste0(round(100*mean(datGenes$gene_biotype=='protein_coding')),
           '% of genes are protein coding genes'))
## 37% of genes are protein coding genes
sort(table(datGenes$gene_biotype), decreasing=TRUE)
## 
##           protein_coding               pseudogene                  lincRNA 
##                    22363                    14637                     6557 
##                antisense                    miRNA                 misc_RNA 
##                     5047                     3280                     2151 
##                    snRNA                   snoRNA     processed_transcript 
##                     2032                     1518                      749 
##           sense_intronic                     rRNA        sense_overlapping 
##                      640                      560                      205 
##          IG_V_pseudogene                TR_V_gene                IG_V_gene 
##                      170                      150                      133 
##                TR_J_gene   polymorphic_pseudogene          TR_V_pseudogene 
##                       82                       49                       40 
##                IG_D_gene                  Mt_tRNA 3prime_overlapping_ncrna 
##                       27                       22                       18 
##                IG_J_gene                IG_C_gene          IG_C_pseudogene 
##                       18                       14                        8 
##                TR_C_gene          TR_J_pseudogene          IG_J_pseudogene 
##                        6                        4                        3 
##                TR_D_gene                  Mt_rRNA     processed_pseudogene 
##                        3                        2                        1

Most of the genes with low expression levels are not protein-coding

plot_data = data.frame('ID' = rownames(datExpr), 'MeanExpr' = apply(datExpr, 1, mean), 'ProteinCoding'=datGenes$gene_biotype=='protein_coding')

ggplotly(plot_data %>% ggplot(aes(log2(MeanExpr+1), fill=ProteinCoding, color=ProteinCoding)) + geom_density(alpha=0.5) + 
         theme_minimal())
rm(plot_data)

We only lose 4 genes with a SFARI score, but they all have low scores (4 and 5)

df = SFARI_genes %>% dplyr::select(-gene_biotype) %>% inner_join(datGenes, by=c('ID'='ensembl_gene_id'))

print(paste0('Filtering protein coding genes, we are left with ', length(unique(df$`gene-symbol`[df$gene_biotype=='protein_coding'])),
             ' SFARI genes'))
## [1] "Filtering protein coding genes, we are left with 972 SFARI genes"
kable(df %>% filter(! `gene-symbol` %in% df$`gene-symbol`[df$gene_biotype=='protein_coding']) %>% 
      dplyr::select(ID, `gene-symbol`, `gene-score`, gene_biotype, syndromic, `number-of-reports`), caption='Lost Genes')
Lost Genes
ID gene-symbol gene-score gene_biotype syndromic number-of-reports
ENSG00000204466 DGKK 5 processed_transcript 0 1
ENSG00000104725 NEFL 5 processed_transcript 0 2
ENSG00000197558 SSPO 4 processed_transcript 0 3
ENSG00000157152 SYN2 4 processed_transcript 0 6
rm(df)
if(!all(rownames(datExpr)==rownames(datGenes))) print('!!! gene rownames do not match!!!')

to_keep = datGenes$gene_biotype=='protein_coding'
datExpr = datExpr %>% filter(to_keep)
datGenes = datGenes %>% filter(to_keep)
rownames(datExpr) = datGenes$ensembl_gene_id
rownames(datGenes) = datGenes$ensembl_gene_id

print(paste0(length(unique(SFARI_genes$`gene-symbol`[SFARI_genes$ID %in% rownames(datExpr)])), ' SFARI genes remaining'))
## [1] "972 SFARI genes remaining"
print(paste0('Removed ', sum(!to_keep), ' genes, ', sum(to_keep), ' remaining'))
## [1] "Removed 38126 genes, 22363 remaining"


3. Filter genes with low expression levels

\(\qquad\) 3.1 Remove genes with zero expression in all of the samples

to_keep = rowSums(datExpr)>0
datGenes = datGenes[to_keep,]
datExpr = datExpr[to_keep,]


print(paste0(length(unique(SFARI_genes$`gene-symbol`[SFARI_genes$ID %in% rownames(datExpr)])), ' SFARI genes remaining'))
## [1] "968 SFARI genes remaining"
print(paste0('Removed ', sum(!to_keep), ' genes, ', sum(to_keep), ' remaining'))
## [1] "Removed 2811 genes, 19552 remaining"

\(\qquad\) 2.2 Removing genes with a mean expression lower than 5 (it’s quite high compared to the threshold of 1.7 used in Gandal’s dataset).

threshold = 5
plot_data = data.frame('id'=rownames(datExpr), 'mean_expression' = rowMeans(datExpr))

ggplotly(plot_data %>% ggplot(aes(x=mean_expression)) + 
         geom_density(color='#0099cc', fill='#0099cc', alpha=0.3) + 
         geom_vline(xintercept=threshold, color='gray') + scale_x_log10() + 
         ggtitle('gene Mean Expression distribution') + theme_minimal())
to_keep = rowMeans(datExpr)>threshold
datGenes = datGenes[to_keep,]
datExpr = datExpr[to_keep,]

print(paste0(length(unique(SFARI_genes$`gene-symbol`[SFARI_genes$ID %in% rownames(datExpr)])), ' SFARI genes remaining'))
## [1] "884 SFARI genes remaining"
print(paste0('Removed ', sum(!to_keep), ' genes, ', sum(to_keep), ' remaining'))
## [1] "Removed 4471 genes, 15081 remaining"
rm(threshold, plot_data)

Filter out outliers: Using node connectivity as a distance measure, normalising it and filtering out genes farther away than 2 standard deviations from the left (lower connectivity than average, not higher)

absadj = datExpr %>% bicor %>% abs
## alpha: 1.000000
netsummary = fundamentalNetworkConcepts(absadj)
ku = netsummary$Connectivity
z.ku = (ku-mean(ku))/sqrt(var(ku))

plot_data = data.frame('sample'=1:length(z.ku), 'distance'=z.ku, 'Sample_ID'=datMeta$ID, 
                       'Subject_ID'=datMeta$Subject_ID, 'Site'=datMeta$SiteHM,
                       'Brain_Lobe'=datMeta$brain_lobe, 'Sex'=datMeta$Gender, 'Age'=datMeta$Age,
                       'Diagnosis'=datMeta$Diagnosis, 'PMI'=as.numeric(datMeta$PMI))
selectable_scatter_plot(plot_data, plot_data[,-c(1,2)])
print(paste0('Outlier samples: ', paste(as.character(plot_data$Sample_ID[plot_data$distance< -2]), collapse=', ')))
## [1] "Outlier samples: ba19.s13, ba44.s23"
to_keep = z.ku >= -2
datMeta = datMeta[to_keep,]
datExpr = datExpr[,to_keep]

print(paste0('Removed ', sum(!to_keep), ' samples, ', sum(to_keep), ' remaining'))
## [1] "Removed 2 samples, 106 remaining"
rm(absadj, netsummary, ku, z.ku, plot_data, to_keep)
cat(paste0('After filtering, the dataset consists of ', nrow(datExpr), ' genes and ', ncol(datExpr), ' samples belonging to ', length(unique(datMeta$Subject_ID)),' subjects'))
## After filtering, the dataset consists of 15081 genes and 106 samples belonging to 72 subjects




Batch Effects

According to Tackling the widespread and critical impact of batch effects in high-throughput data, technical artifacts can be an important source of variability in the data, so batch correction should be part of the standard preprocessing pipeline of gene expression data.

They say Processing group and Date of the experiment are good batch surrogates, I only have processing group, so I’m going to see if this affects the data in any clear way to use it as a surrogate.

All the information we have is the Brain Bank (H/M), and although all the samples were obtained from the Autism Tissue Project, we don’t have any more specific information about who preprocessed each sample

table(datMeta$SiteHM)
## 
##  H  M 
## 49 57


There seems to be an important bias between the site that processed the samples and the objective variable, so the batch effect can be confused with the diagnosis effect.

table(datMeta$SiteHM, datMeta$Diagnosis)
##    
##     CTL ASD
##   H  13  36
##   M  45  12

Samples don’t seem to cluster together that strongly for each batch, although there does seem to be some kind of relation, but it could be due to diagnosis, not to batch (this is the problem with unbalanced diagnosis between batches!)

h_clusts = datExpr %>% t %>% dist %>% hclust %>% as.dendrogram

create_viridis_dict = function(){
  min_age = datMeta$Age %>% min
  max_age = datMeta$Age %>% max
  viridis_age_cols = viridis(max_age - min_age + 1)
  names(viridis_age_cols) = seq(min_age, max_age)
  
  return(viridis_age_cols)
}
viridis_age_cols = create_viridis_dict()

dend_meta = datMeta[match(labels(h_clusts), datMeta$ID),] %>% 
            mutate('Site' = ifelse(SiteHM=='H', '#F8766D', '#00BFC4'),
                   'Diagnosis' = ifelse(Diagnosis=='CTL','#008080','#86b300'), # Blue control, Green ASD
                   'Sex' = ifelse(Gender=='F','#ff6666','#008ae6'),            # Pink Female, Blue Male
                   'Region' = case_when(brain_lobe=='Frontal'~'#F8766D',        # ggplot defaults for 2 colours
                                        brain_lobe=='Occipital'~'#00BFC4'),
                   'Age' = viridis_age_cols[as.character(Age)]) %>%             # Purple: young, Yellow: old
            dplyr::select(Age, Region, Sex, Diagnosis, Site)
h_clusts %>% set('labels', rep('', nrow(datMeta))) %>% set('branches_k_color', k=9) %>% plot
colored_bars(colors=dend_meta)

rm(h_clusts, dend_meta, create_viridis_dict, viridis_age_cols)

Comparing the mean expression of each sample by batch we can see there is some batch effect differentiating them

plot_data_b1 = data.frame('Mean'=colMeans(datExpr[,datMeta$SiteHM=='H']), 'Batch'='H')
plot_data_b2 = data.frame('Mean'=colMeans(datExpr[,datMeta$SiteHM=='M']), 'Batch'='M')

plot_data = rbind(plot_data_b1, plot_data_b2)
mu = plot_data %>% group_by(Batch) %>% dplyr::summarise(BatchMean=mean(Mean))

ggplotly(plot_data %>% ggplot(aes(x=Mean, color=Batch, fill=Batch)) + geom_density(alpha=0.3) + 
         geom_vline(data=mu, aes(xintercept=BatchMean, color=Batch), linetype='dashed') +
         ggtitle('Mean expression by sample grouped by Batch') + scale_x_log10() + theme_minimal())
rm(plot_data_b1, plot_data_b2, plot_data, mu)


Looking for unknown sources of batch effects

Following the pipeline from Surrogate variable analysis: hidden batch effects where sva is used with DESeq2.

Create a DeseqDataSet object, estimate the library size correction and save the normalized counts matrix

counts = datExpr %>% as.matrix
rowRanges = GRanges(datGenes$chromosome_name,
                  IRanges(datGenes$start_position, width=datGenes$length),
                  strand=datGenes$strand,
                  feature_id=datGenes$ensembl_gene_id)
se = SummarizedExperiment(assays=SimpleList(counts=counts), rowRanges=rowRanges, colData=datMeta)
dds = DESeqDataSet(se, design =~Diagnosis)
## converting counts to integer mode
dds = estimateSizeFactors(dds)
norm.cts = counts(dds, normalized=TRUE)

Provide the normalized counts and two model matrices to SVA. The first matrix uses the biological condition, and the second model matrix is the null model.

mod = model.matrix(~ Diagnosis, colData(dds))
mod0 = model.matrix(~ 1, colData(dds))
sva_fit = svaseq(norm.cts, mod=mod, mod0=mod0)
## Number of significant surrogate variables is:  23 
## Iteration (out of 5 ):1  2  3  4  5
rm(mod, mod0, norm.cts)

Found 23 surrogate variables, since there is no direct way to select which ones to pick Bioconductor answer, decided to keep all of them.

Include SV estimations to datMeta information

sv_data = sva_fit$sv %>% data.frame
colnames(sv_data) = paste0('SV',1:ncol(sv_data))

datMeta_sva = cbind(datMeta, sv_data)

rm(sv_data, sva_fit)

In conclusion: Site could work as a surrogate for batch effects, but has the HUGE downside that is correlated to Diagnosis. The sva package found other 23 variables that could work as surrogates which are now included in datMeta and should be included in the DEA.


Normalisation and Differential Expression Analysis

Using DESeq2 package to perform normalisation. Chose this package over limma because limma uses the log transformed data as input instead of the raw counts and I have discovered that in this dataset, this transformation affects genes differently depending on their mean expression level, and genes with a high SFARI score are specially affected by this.

plot_data = data.frame('ID'=rownames(datExpr), 'Mean'=rowMeans(datExpr), 'SD'=apply(datExpr,1,sd))

plot_data %>% ggplot(aes(Mean, SD)) + geom_point(color='#0099cc', alpha=0.1) + geom_abline(color='gray') +
              scale_x_log10() + scale_y_log10() + theme_minimal()

rm(plot_data)
counts = datExpr %>% as.matrix
rowRanges = GRanges(datGenes$chromosome_name,
                  IRanges(datGenes$start_position, width=datGenes$length),
                  strand=datGenes$strand,
                  feature_id=datGenes$ensembl_gene_id)
se = SummarizedExperiment(assays=SimpleList(counts=counts), rowRanges=rowRanges, colData=datMeta_sva)
dds = DESeqDataSet(se, design = ~ SiteHM + SV1 + SV2 + SV3 + SV4 + SV5 + SV6 + SV7 + SV8 + SV9 + 
                                  SV10 + SV11 + SV12 + SV13 + SV14 + SV15 + SV16 + SV17 + SV18 +
                                  SV19 + SV20 + SV21 + SV22 + SV23 + Diagnosis)
## converting counts to integer mode
# Perform DEA
#dds = DESeq(dds) # Changed this for the three lines below because some rows don't converge
dds = estimateSizeFactors(dds)
dds = estimateDispersions(dds)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
dds = nbinomWaldTest(dds, maxit=10000)
## 191 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
DE_info = results(dds, lfcThreshold=0, altHypothesis='greaterAbs')

# Perform vst
vsd = vst(dds)

datExpr_vst = assay(vsd)
datMeta_vst = colData(vsd)
datGenes_vst = rowRanges(vsd)

rm(counts, rowRanges, se, vsd)

Using the plotting function DESEq2’s manual proposes to study vst’s output it looks like the data could be homoscedastic

meanSdPlot(datExpr_vst, plot=FALSE)$gg + theme_minimal()

When plotting point by point it seems like the genes with the lowest values behave differently

plot_data = data.frame('ID'=rownames(datExpr_vst), 'Mean'=rowMeans(datExpr_vst), 'SD'=apply(datExpr_vst,1,sd))

plot_data %>% ggplot(aes(Mean, SD)) + geom_point(color='#0099cc', alpha=0.2) + 
              scale_x_log10() + scale_y_log10() + theme_minimal()

rm(plot_data)





Save filtered and annotated dataset

*Could have done this since before

save(datExpr, datMeta, datGenes, file='./../Data/filtered_raw_data.RData')
#load('./../Data/Gandal/filtered_raw_data.RData')

Rename normalised datasets to continue working with these

datExpr = datExpr_vst
datMeta = datMeta_vst %>% data.frame
datGenes = datGenes_vst

print(paste0(length(unique(SFARI_genes$`gene-symbol`[SFARI_genes$ID %in% rownames(datExpr)])), ' SFARI genes remaining'))
## [1] "884 SFARI genes remaining"
print(paste0('After filtering, the dataset consists of ', nrow(datExpr), ' genes and ', ncol(datExpr), ' samples'))
## [1] "After filtering, the dataset consists of 15081 genes and 106 samples"
rm(datExpr_vst, datMeta_vst, datGenes_vst, datMeta_sva)




Batch Effect Correction

By including the surrogate variables in the DESeq formula we only modelled the batch effects into the DEA, but we didn’t actually correct them from the data, for that we need to use ComBat (or other equivalent package) in the already normalised data

SVA surrogate variables

In some places they say you shouldn’t correct these effects on the data because you risk losing biological variation, in others they say you should because they introduce noise to the data. The only thing everyone agrees on is that you shouldn’t remove them before performing DEA but instead include them in the model.

Based on the conclusions from Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis it seems like it may be a good idea to remove the batch effects from the data and not only from the DE analysis:

  • Using SVA, ComBat or related tools can increase the power to identify specific signals in complex genomic datasets (they found “greatly sharpened global and gene-specific differential expression across treatment groups”)

  • But caution should be exercised to avoid removing biological signal of interest

  • We must be precise and deliberate in the design and analysis of experiments and the resulting data, and also mindful of the limitations we impose with our own perspective

  • Open data exploration is not possible after such supervised “cleaning”, because effects beyond those stipulated by the researcher may have been removed

Comparing data with and without surrogate variable correction

# Taken from https://www.biostars.org/p/121489/#121500
correctDatExpr = function(datExpr, mod, svs) {
  X = cbind(mod, svs)
  Hat = solve(t(X) %*% X) %*% t(X)
  beta = (Hat %*% t(datExpr))
  rm(Hat)
  gc()
  P = ncol(mod)
  return(datExpr - t(as.matrix(X[,-c(1:P)]) %*% beta[-c(1:P),]))
}

pca_samples_before = datExpr %>% t %>% prcomp
pca_genes_before = datExpr %>% prcomp

# Correct
mod = model.matrix(~ Diagnosis, colData(dds))
svs = datMeta %>% dplyr::select(SV1:SV23) %>% as.matrix
datExpr_corrected = correctDatExpr(as.matrix(datExpr), mod, svs)

pca_samples_after = datExpr_corrected %>% t %>% prcomp
pca_genes_after = datExpr_corrected %>% prcomp

rm(correctDatExpr)

Samples

Removing batch effects has a big impact in the distribution of the samples, separating them by diagnosis pretty well just using the first principal component (although the separation is not as good as with the Gandal dataset)

pca_samples_df = rbind(data.frame('ID'=colnames(datExpr), 'PC1'=pca_samples_before$x[,1],
                                  'PC2'=pca_samples_before$x[,2], 'corrected'=0),
                       data.frame('ID'=colnames(datExpr), 'PC1'=-pca_samples_after$x[,1],
                                  'PC2'=-pca_samples_after$x[,2], 'corrected'=1)) %>%
                 left_join(datMeta %>% mutate('ID'=rownames(datMeta)), by='ID')

ggplotly(pca_samples_df %>% ggplot(aes(PC1, PC2, color=Diagnosis)) + geom_point(aes(frame=corrected, id=ID), alpha=0.75) + 
         xlab(paste0('PC1 (corr=', round(cor(pca_samples_before$x[,1],pca_samples_after$x[,1]),2),
                     '). % Var explained: ', round(100*summary(pca_samples_before)$importance[2,1],1),' to ',
                     round(100*summary(pca_samples_after)$importance[2,1],1))) +
         ylab(paste0('PC2 (corr=', round(cor(pca_samples_before$x[,2],pca_samples_after$x[,2]),2),
                     '). % Var explained: ', round(100*summary(pca_samples_before)$importance[2,2],1),' to ',
                     round(100*summary(pca_samples_after)$importance[2,2],1))) +
         ggtitle('Samples') + theme_minimal())
rm(pca_samples_df)


Genes

It seems like the sva correction preserves the mean expression of the genes and erases almost everything else (although what little else remains is enough to characterise the two Diagnosis groups relatively well using only the first PC)

*Plot is done with only 10% of the genes because it was too heavy otherwise

pca_genes_df = rbind(data.frame('ID'=rownames(datExpr), 'PC1'=pca_genes_before$x[,1],
                                'PC2'=pca_genes_before$x[,2], 'corrected'=0, 'MeanExpr'=rowMeans(datExpr)),
                     data.frame('ID'=rownames(datExpr), 'PC1'=-pca_genes_after$x[,1],
                                'PC2'=-pca_genes_after$x[,2], 'corrected'=1, 'MeanExpr'=rowMeans(datExpr)))

keep_genes = rownames(datExpr) %>% sample(0.1*nrow(datExpr))

pca_genes_df = pca_genes_df %>% filter(ID %in% keep_genes)

ggplotly(pca_genes_df %>% ggplot(aes(PC1, PC2,color=MeanExpr)) + geom_point(alpha=0.3, aes(frame=corrected, id=ID)) +
         xlab(paste0('PC1 (corr=', round(cor(pca_genes_before$x[,1],pca_genes_after$x[,1]),2),
                     '). % Var explained: ', round(100*summary(pca_genes_before)$importance[2,1],1),' to ',
                     round(100*summary(pca_genes_after)$importance[2,1],1))) +
         ylab(paste0('PC2 (corr=', round(cor(pca_genes_before$x[,2],pca_genes_after$x[,2]),2),
                     '). % Var explained: ', round(100*summary(pca_genes_before)$importance[2,2],1),' to ',
                     round(100*summary(pca_genes_after)$importance[2,2],1))) +
         scale_color_viridis() + ggtitle('Genes') + theme_minimal())
rm(pca_samples_before, pca_genes_before, mod, svs, pca_samples_after, pca_genes_after, pca_genes_df, keep_genes)

Decided to keep the corrected expression dataset

datExpr = datExpr_corrected

rm(datExpr_corrected)


Processing site

Even after correcting the dataset for the surrogate variables found with sva, there is still a difference in mean expression by processing site. The problem is that processing site is correlated with Diagnosis, so I don’t know if by correcting it I would be erasing relevant information related to ASD… I have to read more about this

plot_data_b1 = data.frame('Mean'=colMeans(datExpr[,datMeta$SiteHM=='H']), 'Batch'='H')
plot_data_b2 = data.frame('Mean'=colMeans(datExpr[,datMeta$SiteHM=='M']), 'Batch'='M')

plot_data = rbind(plot_data_b1, plot_data_b2)
mu = plot_data %>% group_by(Batch) %>% dplyr::summarise(BatchMean=mean(Mean))

ggplotly(plot_data %>% ggplot(aes(x=Mean, color=Batch, fill=Batch)) + geom_density(alpha=0.3) + 
         geom_vline(data=mu, aes(xintercept=BatchMean, color=Batch), linetype='dashed') +
         ggtitle('Mean expression by sample grouped by processing date') + scale_x_log10() + theme_minimal())
rm(plot_data_b1, plot_data_b2, plot_data, mu)


Performing Batch Correction for processing site

I will save the batch corrected dataset as a different dataset because of the correlation between processing site and diagnosis

https://support.bioconductor.org/p/50983/

datExpr_combat = datExpr %>% as.matrix %>% ComBat(batch=datMeta$SiteHM)
## Found2batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data

Now both batches have almost the same mean expression (we’d have to see what effect this has on the Diagnosis variable)

plot_data_b1 = data.frame('Mean'=colMeans(datExpr_combat[,datMeta$SiteHM=='H']), 'Batch'='H')
plot_data_b2 = data.frame('Mean'=colMeans(datExpr_combat[,datMeta$SiteHM=='M']), 'Batch'='M')

plot_data = rbind(plot_data_b1, plot_data_b2)
mu = plot_data %>% group_by(Batch) %>% dplyr::summarise(BatchMean=mean(Mean))

ggplotly(plot_data %>% ggplot(aes(x=Mean, color=Batch, fill=Batch)) + geom_density(alpha=0.3) + 
         geom_vline(data=mu, aes(xintercept=BatchMean, color=Batch), linetype='dashed') +
         ggtitle('Mean expression by sample grouped by processing date') + scale_x_log10() + theme_minimal())
rm(plot_data_b1, plot_data_b2, plot_data, mu)



Save preprocessed datasets with and without ComBat correction

save(datExpr, datMeta, datGenes, DE_info, dds, file='./../Data/preprocessed_data.RData')
save(datExpr_combat, datMeta, datGenes, DE_info, dds, file='./../Data/preprocessed_data_ComBat.RData')
#load('./../Data/Gandal/preprocessed_data.RData')




Session info

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: Scientific Linux 7.6 (Nitrogen)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] knitr_1.24                  dendextend_1.13.2          
##  [3] vsn_3.54.0                  WGCNA_1.68                 
##  [5] fastcluster_1.1.25          dynamicTreeCut_1.63-1      
##  [7] sva_3.34.0                  genefilter_1.68.0          
##  [9] mgcv_1.8-28                 nlme_3.1-139               
## [11] DESeq2_1.26.0               SummarizedExperiment_1.16.1
## [13] DelayedArray_0.12.2         BiocParallel_1.20.1        
## [15] matrixStats_0.55.0          Biobase_2.46.0             
## [17] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
## [19] IRanges_2.20.2              S4Vectors_0.24.2           
## [21] BiocGenerics_0.32.0         biomaRt_2.42.0             
## [23] ggExtra_0.9                 GGally_1.4.0               
## [25] gridExtra_2.3               viridis_0.5.1              
## [27] viridisLite_0.3.0           RColorBrewer_1.1-2         
## [29] plotlyutils_0.0.0.9000      plotly_4.9.1               
## [31] glue_1.3.1                  reshape2_1.4.3             
## [33] forcats_0.4.0               stringr_1.4.0              
## [35] dplyr_0.8.3                 purrr_0.3.3                
## [37] readr_1.3.1                 tidyr_1.0.0                
## [39] tibble_2.1.3                ggplot2_3.2.1              
## [41] tidyverse_1.3.0            
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1           backports_1.1.5        Hmisc_4.2-0           
##   [4] BiocFileCache_1.10.2   plyr_1.8.5             lazyeval_0.2.2        
##   [7] splines_3.6.0          crosstalk_1.0.0        robust_0.4-18.2       
##  [10] digest_0.6.23          foreach_1.4.7          htmltools_0.4.0       
##  [13] GO.db_3.10.0           fansi_0.4.1            magrittr_1.5          
##  [16] checkmate_1.9.4        memoise_1.1.0          fit.models_0.5-14     
##  [19] doParallel_1.0.15      cluster_2.0.8          limma_3.42.0          
##  [22] annotate_1.64.0        modelr_0.1.5           askpass_1.1           
##  [25] prettyunits_1.0.2      colorspace_1.4-1       rrcov_1.4-7           
##  [28] blob_1.2.0             rvest_0.3.5            rappdirs_0.3.1        
##  [31] haven_2.2.0            xfun_0.8               hexbin_1.28.0         
##  [34] crayon_1.3.4           RCurl_1.95-4.12        jsonlite_1.6          
##  [37] impute_1.60.0          zeallot_0.1.0          survival_2.44-1.1     
##  [40] iterators_1.0.12       gtable_0.3.0           zlibbioc_1.32.0       
##  [43] XVector_0.26.0         DEoptimR_1.0-8         scales_1.1.0          
##  [46] mvtnorm_1.0-11         DBI_1.1.0              miniUI_0.1.1.1        
##  [49] Rcpp_1.0.3             xtable_1.8-4           progress_1.2.2        
##  [52] htmlTable_1.13.1       foreign_0.8-71         bit_1.1-15.1          
##  [55] preprocessCore_1.48.0  Formula_1.2-3          htmlwidgets_1.5.1     
##  [58] httr_1.4.1             ellipsis_0.3.0         acepack_1.4.1         
##  [61] farver_2.0.3           pkgconfig_2.0.3        reshape_0.8.8         
##  [64] XML_3.98-1.20          nnet_7.3-12            dbplyr_1.4.2          
##  [67] locfit_1.5-9.1         labeling_0.3           tidyselect_0.2.5      
##  [70] rlang_0.4.2            later_1.0.0            AnnotationDbi_1.48.0  
##  [73] munsell_0.5.0          cellranger_1.1.0       tools_3.6.0           
##  [76] cli_2.0.1              generics_0.0.2         RSQLite_2.2.0         
##  [79] broom_0.5.3            evaluate_0.14          fastmap_1.0.1         
##  [82] yaml_2.2.0             bit64_0.9-7            fs_1.3.1              
##  [85] robustbase_0.93-5      mime_0.8               xml2_1.2.2            
##  [88] compiler_3.6.0         rstudioapi_0.10        curl_4.3              
##  [91] affyio_1.56.0          reprex_0.3.0           geneplotter_1.64.0    
##  [94] pcaPP_1.9-73           stringi_1.4.5          highr_0.8             
##  [97] lattice_0.20-38        Matrix_1.2-17          vctrs_0.2.1           
## [100] pillar_1.4.3           lifecycle_0.1.0        BiocManager_1.30.10   
## [103] data.table_1.12.8      bitops_1.0-6           httpuv_1.5.2          
## [106] affy_1.64.0            R6_2.4.1               latticeExtra_0.6-28   
## [109] promises_1.1.0         codetools_0.2-16       MASS_7.3-51.4         
## [112] assertthat_0.2.1       openssl_1.4.1          withr_2.1.2           
## [115] GenomeInfoDbData_1.2.2 hms_0.5.3              grid_3.6.0            
## [118] rpart_4.1-15           rmarkdown_1.14         Cairo_1.5-10          
## [121] shiny_1.4.0            lubridate_1.7.4        base64enc_0.1-3